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Abstract 

We present Vector-Space Markov Random Fields (VS-MRFs), a novel class of 
undirected graphical models where each variable can belong to an arbitrary vector 
space. VS-MRFs generalize a recent line of work on scalar-valued, uni-parameter 
exponential family and mixed graphical models, thereby greatly broadening the 
class of exponential families available (e.g., allowing multinomial and Dirichlet 
distributions). Specifically, VS-MRFs are the joint graphical model distributions 
where the node-conditional distributions belong to generic exponential families 
with general vector space domains. We also present a sparsistent M-estimator for 
learning our class of MRFs that recovers the correct set of edges with high proba¬ 
bility. We validate our approach via a set of synthetic data experiments as well as 
a real-world case study of over four million foods from the popular diet tracking 
app MyFitnessPal. Our results demonstrate that our algorithm performs well em¬ 
pirically and that VS-MRFs are capable of capturing and highlighting interesting 
structure in complex, real-world data. All code for our algorithm is open source 
and publicly available. 


1 Introduction 


Undirected graphical models, also known as Markov Random Fields (MRFs), are a 
popular class of models for probability distributions over random vectors. Popular 
parametric instances include Gaussian MRFs, Ising, and Potts models, but these are all 
suited to specific data-types: Ising models for binary data, Gaussian MRFs for thin¬ 
tailed continuous data, and so on. Conversely, when there is prior knowledge of the 
graph structure but limited information otherwise, nonparametric approaches are avail¬ 
able Sudderth et al. | 2010| . A recent line of work has considered the challenge of 
specifying classes of MRFs targeted to the data-types in the given application, when 
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the structure is unknown. For the specific case of homogeneous data, where each vari¬ 
able in the random vector has the same data-type, Yang et al. 120121 proposed a general 
subclass of homogeneous MRFs. In their construction, they imposed the restriction that 
each variable conditioned on other variables belong to a shared exponential family dis¬ 
tribution, and then performed a Hammersley-Clifford-like analysis to derive the corre¬ 
sponding joint graphical model distribution, consistent with these node-conditional dis¬ 
tributions. As they showed, even classical instances belong to this sub-class of MRFs; 
for instance, with Gaussian MRFs and Ising models, the node-conditional distributions 
follow univariate Gaussian and Bernoulli distributions respectively. 


Yang et al. 120141 then proposed a class of mixed MRFs that extended this construc¬ 


tion to allow for random vectors with variables belonging to different data types, and 
allowing each node-conditional distribution to be drawn from a different univariate, 
uni-parameter exponential family member (such as a Gaussian with known variance 
or a Bernoulli distribution). This flexibility in allowing for different univariate expo¬ 
nential family distributions yielded a class of mixed MRFs over heterogeneous random 
vectors that were capable of modeling a much wider class of distributions than was 
previously feasible, opening up an entirely new suite of possible applications. 

To summarize, the state of the art can specify MRFs over heterogeneous data-typed 
random vectors, under the restriction that each variable conditioned on others belong to 
a uni-parameter, univariate exponential family distribution. But in many applications, 
such a restriction would be too onerous. For instance, a discrete random variable is 
best modeled by a categorical distribution, but this is a multi-parameter exponential 
family distribution, and does not satisfy the required restriction above. Other multi¬ 
parameter exponential family distributions popular in machine learning include gamma 
distributions with unknown shape parameter and Gaussian distributions with unknown 
variance. Another restriction above is that the variables be scalar-valued; but in many 
applications the random variables could belong to more general vector spaces, for ex¬ 
ample a Dirichlet distribution. 

As modern data modeling requirements evolve, extending MRFs beyond such re¬ 
strictive paradigms is becoming increasingly important. In this paper, we thus ex- 

2014). As opposed to other ap- 


tend the above line of work in Yang et al. 12012 


proaches which merely cluster scalar variables Vats and Moura | 2012| , we allow node¬ 
conditional distributions to belong to a generic exponential family with a general vector 
space domain. We then perform a subtler Hammersley-Clifford-like analysis to derive 
a novel class of vector-space MRFs (VS-MRFs) as joint distributions consistent with 
these node-conditional distributions. This class of VS-MRFs provides support for the 
many modelling requirements outlined above, and could thus greatly expand the po¬ 
tential applicability of MRFs to new scientific analyses. 

We also introduce an M-estimator for learning this class of VS-MRFs based on the 
sparse group lasso, and show that it is sparsistent, and that it succeeds in recovering 
the underlying edges of the graphical model. To solve the M-estimation problem, we 
also provide a scalable optimization algorithm based on Alternating Direction Method 
of Multipliers (ADMM) Boyd et al. 1 201 1) . We validate our approach empirically via 
synthetic experiments measuring performance across a variety of scenarios. We also 
demonstrate the usefulness of VS-MRFs by modeling a real-world dataset of over four 
million foods from the MyFitnessPal food database. 
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The remainder of this paper is organized as follows. SectionEmrovides background 
on mixed MRFs in the uni-parameter, univariate case. Sectionp] details our general¬ 
ization of the mixed MRF derivations to the vector-space case. Section [4] introduces 
our M-estimator and derives its sparsistency statistical guarantees. Section [5] contains 
our synthetic experiments and the MyFitnessPal case study. Finally, Sectionplpresents 
concluding remarks and potential future work. 


2 Background: Scalar Mixed Graphical Models 

Let X = (X -\, X‘ 2 - • ■ ■ X p ) be a p-dimensional random vector, where each variable X r 
has domain X r . An undirected graphical model or a Markov Random Field (MRF) is 
a family of joint distributions over the random vector X that is specified by a graph 
G = ( [V,E ), with nodes corresponding to each of the p random variables { X r jy =1 , 
and edges that specify the factorization of the joint as: 

P(X) oc n 1>c(Xc), 

ceC(G) 


where C{G) is the set of fully connected subgraphs (or cliques) of the graph G, Xc = 
{Xa} sgc denotes the subset of variables in the subset C C V, and {ipc{Xc)}ceC(G) 
are clique-wise functions, each of which is a “local function” in that it only depend on 
the variables in the corresponding clique, so that i/jc{Xc) only depends on the variable 
subset Xc- 

Gaussian MRFs, Ising MRFs, etc. make particular parametric assumptions on these 
clique-wise functions, but a key question is whether there exists a more flexible speci¬ 
fication of the form of these clique-wise functions that is targeted to the data-type and 
other characteristics of the random vector X. 

For the specific case where the variables are scalars, so that the domains X r C R, 
in a line of work, Yang et al. 12012 20141 used the following construction to de¬ 
rive a subclass of MRFs targeted to the random vector X. Suppose that for variables 
X r € X r , the following (single-parameter) univariate exponential family distribution 
P(X r ) = exp {Q r B r (X r ) + C r (X r ) — A r (9 r )}, with natural parameter scalar 0, suf¬ 
ficient statistic scalar B r (X r ), base measure C r (X r ) and log normalization constant 
A r {9), serves as a suitable statistical model. Suppose that we use these univariate 
distributions to specify conditional distributions: 


P{X r \X_ r ) =exp{ E r {X_ r )B r (X r )+ 

C r (X r )-Ar(X - r )} ’ 


( 1 ) 


where E r (-) is an arbitrary function of the rest of the variables X_ r that serves as the 
natural parameter. Would these node-conditional distributions for each node r £ V 
be consistent with some joint distribution for some specification of these functions 
{£ r (-)W? Theorem 1 from Yang et al. [2014] shows that there does exist a unique 
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joint MRF distribution with the form: 


P(X- 9) = exp OrB r (X r ) 

{rev 

+ E E 9 rt B t (X t )B r (X r ) + ... 

rev teN(r) 

k , ( 2 ) 

+ ^2 Ot 1 ...t k (X)Y\_B tj (X tj ) 

(ti ,...,t k )eC j =i 

+ E re V Cr(X r )-A(9) 1 


where A(O) is the log-normalization constant. Their proof followed an analysis simi¬ 
lar to the Hammersley-Clifford Theorem Lauritzen [ 1996) , and entailed showing that 
for a consistent joint, the only feasible conditional parameter functions E r (•) had the 
following form: 


E r (X- r ) — 9 r + 'y ' 9 r t.Bt(Xt) + ... 

t£N(r) 

k , 

+ '^2 Ort 2 ...t k (X) JJ B t . (X t .) 

t 2 ,...,t k eN(r ) j —2 


(3) 


where 9 r . := {9 ri 9 rt: ..., d rt2 ...t k } is a set of parameters, and N(r) is the set of 
neighbors of node r. 

While their construction allows the specification of targeted classes of graphical 
models for heterogeneous random vectors, the conditional distribution of each variable 
conditioned on the rest of the variables is assumed to be a single-parameter exponential 
family distribution with a scalar sufficient statistic and natural parameter. Furthermore, 
their Hammersley-Clifford type analysis and sparsistency proofs relied crucially on that 
assumption. However in the case of multi-parameter and multivariate distributions, the 
sufficient statistics are a vector; indeed the random variables need not be scalars at all 
but could belong to a more general vector space. Could one construct classes of MRFs 
for this more general, but prevalent, setting? In the next section, we answer in the 
affirmative, and present a generalization of mixed MRFs to the vector-space case, with 
support for more general exponential families. 


3 Generalization to the Vector-space Case 

Let X = (X'i , X‘ 2 - ■ • ■ X p ) be a p-dimcnsional random vector, where each variable X r 
belongs to a vector space X r . As in the scalar case, we will assume that a suitable 
statistical model for variables X r £ X r is an exponential family distribution 

m r 

P( x r ) = exp{^ 9 rj B rj (X r ) + CV( x r ) - A r {6)}, (4) 

j=i 
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with natural parameters { 9 r j}pfi , and sufficient statistics {B r j}™Z v base measure 
C r (X r ) and log normalization constant A r (9). We assume the sufficient statistics B r j : 
X r i— t R. lie in some Hilbert space FL S , and moreover specify a minimal exponential 
family so that: m r 

Y, a jB rj ( X r )^c, (5) 

j=i 

for any constant c and any vector a 0. We note that even though the variables 
{X r } could lie in general vector spaces, the exponential family distribution above is 
finite-dimensional. However, it has multiple parameters, which is the other facet that 
distinguishes it from the single-parameter univariate setting of Yang et al. |2012||20T4| . 
We defer a generalization of our framework to infinite-dimensional exponential fami¬ 
lies to future work. 

Suppose we use these general exponential family distributions to specify node¬ 
conditional distributions of variables X r conditioned on the rest of the random vari- 

" L)kV P(X r \X _ r ) = e X p{T,T=l E rj( X -r)Br j (X r ) 

( 6 ) 

-\-C r (X r ) — A r {X^ r y\ , 

where {E r j (X_ r )}J!f 1 are arbitrary functions of the rest of the variables that serve 
as natural parameters for the conditional distribution of X r . As before, we ask the 
question whether these node-conditional distributions can be consistent with some joint 
distribution for some specification of the parameter functions {E r j(X_ r )}™ Zj/, the 
following theorem addresses this very question. 


Theorem 1. Let X = (X\. X- 2 . .... X. p ) be a p-dimensional random vector with node- 
conditional distribution of each random vector X r conditioned on the rest of random 
variables as defined in & These node-conditionals are consistent with a joint MRF 
distribution over the random vector X, that is, Markov with respect to a graph G = 
( V , E ) with clique-set C, and with factors of size at most k, if and only if the functions 
{E r ()} r £v specifying the node-conditional distributions have the form: 

m t 

E r i(X- r ) =9 r i + O r i-tjB t j(X t ) + ... 

t£JV(r) j=1 


E 


E 

eJV(r) 




k 

IlB. 

3 = 2 


XXu) 


(7) 


where 9 r . = {9 r i , 9 r i-tj, is a set of parameters, nit is the dimension of the 

sufficient statistic vector for the t ,h node-conditional distribution, and N(r) is the set of 
neighbors of node r in graph G. The corresponding consistent joint MRF distribution 
has the following form: 
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P(X\6) = exp EE O ri B ri (X r ) + ... 


i —1 


^ ^ ij (-^tj ) 

ti,...,tkEC ii=l...mt 1 j =1 


( 8 ) 


+ ]T a(x r ) - a( 0) l 

rev J 

We provide a Hammersley-Clifford type analysis as proof of this theorem in the 
supplementary material, which however has subtleties not present in Yang et al. |2012[ 
2014J , due to the arbitrary vector space domain of X r , and the multiple parameters in 
the exponential families, which consequently entailed leveraging the geometry of the 
corresponding Hilbert spaces {T-L s {s € V } underlying the sufficient statistics {B S j}. 

The above Theorem|T|provides us with a general class of vector-space MRFs (VS- 
MRFs), where each variable could belong to more general vector space domains, and 
whose conditional distributions are specified by more general finite-dimensional expo¬ 
nential families. Consequently, many common distributions can be incorporated into 
VS-MRFs that were previously unsupported or lacking in Yang et al. | 2012[ 2014) . For 
instance, gamma and Gaussian nodes, though univariate, require vector-space param¬ 
eters in order to be fully modeled. Additionally, multivariate distributions that were 
impossible to use with previous methods, such as the multinomial and Dirichlet distri¬ 
butions are now also available. 

3.1 Pairwise conditional and joint distributions 

Given the form of natural parameters in (|7j, the conditional distribution of a node X r 
given all other nodes X_ r for the special case of pairwise MRFs (i.e. k = 2) has the 
form 

! m r 

E O r iB r i (-Xf*) 

m r mt 

+ E EE @ri;tj Btj (Xf ) B r i (X r ) 

t£N(r) i= 1 j =1 


+C r (X r )-A r (X_ r ,6 r 


= exp < /B r {X r ),d r + Y, 0rtB t {X t )\ 

I \ teJV(r) / 


(9) 


+C r (X r ) — A r j 9 r + 'y ' 8 rt Bt(X t ) 

\ t£N(r) 

where 6 r is a vector formed from scalars {6 rl } r ^\, 9 r t is a matrix of dimension m r xmt 
obtained from scalars 9 r ^ t j and (.,.) represents dot product between two vectors. Thus, 
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the joint distribution has the form 


P(X\6) = 

ex P < 51 \ Br(X r ), dr+ J2 9 r t B t (X t ) \ 

[rev \ teN(r) / , (10) 

+ C 'r(X r ) - A[6) 
r£V 

with the log-normalization constant A(6) = log f x exp{Y r ev(Br(X r ) , 9 r +Y t .£N(r) &rtB t {X t ))+ 
Y^rev C r (X r )}. Since A(0) is generally intractable to calculate, we next present an 
efficient approach to learning the structure of VS-MRFs. 


4 Learning VS-MRFs 


To avoid calculation of the log-normalization constant, we approximate the joint dis¬ 
tribution in ( |T()| > with the independent product of node conditionals, also known as the 
pseudo-likelihood , 

p{x\o)n l[p(x r \x_ r ,e r ,e rt ). (ii) 

r 

Let 6 r . = {0 r . () r } be the set of parameters related to the node-conditional distribution 
of node r, where 9\ r = {d r t } t£ y\ r . Since A r () is convex for all exponential families 
Wainwright and Jordan |2008|, this gives us a loss function that is convex in 6 r .\ 


e(0 r .-,v) 


“E [Br(X^),e r + OrtBtiXP) 

i \ \ t€V\r I 

- Ar le r + 0rtBt(X «) 

\ t£V\r 


( 12 ) 


We then seek to find a sparse solution in terms of both edges and individual parameters 
by employing the sparse group lasso regularization penalty Friedman et al. | 2010| , 
|Simon et al.||2013l : 


R(6 r .) = \i ]T 

t£V\r 


rt | 2 + a 2 \\e v \ 


(13) 


where ;/ r( = m r x nit is the number of parameters in the pseudo-edge from node r 
to node t (i.e., the edge (r, t) in the r th node-conditional). This yields a collection of 
independent convex optimization problems, one for each node-conditional. 

minimize t{6 r .\T>) + R(9 r .) 

We next present an approach to solving this problem based on Alternating Direction 
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Method of Multipliers (ADMM) Boyd et al. 1 201 1| . 


4.1 Optimization Procedure 

We first introduce a slack variable z into © to adhere to the canonical form of 
ADMM. For notational simplicity, we omit the data parameter V from the loss function 
and the subscripts in 0 r . and A r since it is clear we are dealing with the optimization 
of a single node-conditional. 


minimize £(0) + R(z) 
subject to 0 = z 


(15) 


where length(O) = r. The augmented Lagrangian is 

L a (0 , z, p) = £(0) + R(z) + p T (0 — z) + (a/2) \\0 - zf 2 . (16) 

Defining the residual of the slack r = 0 — z, we instead use the scaled form with 
u = (l/a)p. ADMM proceeds in an alternating fashion, performing the following 
updates at each iteration: 

0 k+1 = argmin (^£(0) + (a/2) ||# — z k + ||~^) 

z k+1 = argmin [r(z) + (a/2) || 0 k+1 - 0 + 

u k+1 =u k + 0 k+l - z k+1 


(17) 

( 18 ) 
(19) 


Updating 8 k+1 . The j th subgradient of 6 is gj(0) = —Bj + Vj A(6) + a(0j + z k — 
u k ). Note that the log-partition function, A(g), over the natural parameters, 77 = BO , is 
available in closed form for most commonly-used exponential families. Thus, V 2 A(0) 
is a weighted sum of rank-one matrices. In cases where the number of samples is much 
less than the total number of parameters (i.e. n « r), we can efficiently calculate 


an exact Newton update in 0(t) by leveraging the matrix inversion lemma Boyd and 


Vandenberghe |2009||. Otherwise, we use a diagonal approximation of the Hessian and 


perform a quasi-Newton update. 


k +1 


Updating z 

[20131 of R(z): 


We can reformulate (18 1 as the proximal operator Parikh and Boyd 


Pro x R/a (y) = argmin (r(z) + (a/ 2 ) ||z - y|| 


( 20 ) 


where y = 0 k+1 + u k . From Friedman et al. 120101, it is straightforward to show 


that the update has a closed-form solution for each j~ h block of edge parameters, 

(|5'(a(2/j),A 2 )|| 2 - ^7-\ 1 ) + S(a(y J ),\ 2 ) 


r fc+1 _ 
z j ~ 


a \\S(a(yj), A 2 )|| 2 + 1 - a) 


( 21 ) 
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where S(x, A) is the soft-thresholding operator on x with cutoff at A. 


Updating u k+1 . Per ADMM, closed-form is given in ( |19| ). 

We iterate each of the above update steps in turn until convergence, then AND 
pseudo-edges when stitching the graph back together. 


4.2 Domain constraints 

Many exponential family distributions require parameters with bounded domain. These 
bounds correspond to affine constraints on subsets of 6 in the ADMM algorithm^ 
Often these constraints are simple implicit restrictions to R + or R . In these cases 
the log-normalization function A(rj) serves as a built-in log-barrier function. For in¬ 
stance, a normal distribution with unknown mean //. and unknown variance cr 2 has 
natural parameters 771 = and 772 = — 5 ^ 2 . implying 772 < 0. However, since 

A(j]) = — — \ ln(— 2772 ), this constraint will be effectively enforced so long as we 

are given a feasible starting point for 77 . Such a feasible point can always be discovered 
using a standard phase I method Boyd and Vandenberghe [ 2009| . In the case of equal¬ 
ity requirements, such as categorical and multinomial distributions, we can directly in¬ 
corporate the constraints into the ADMM algorithm and solve an equality-constrained 
Newton’s method when updating 6. 


4.3 Sparsistency 

We next provide the mathematical conditions that ensure with high probability our 
learning procedure recovers the true graph structure underlying the joint distribution. 
Our results rely on similar sufficient conditions to those imposed in papers analyzing 
the Lasso Wainwright 120091 and the ! \ //2 penalty in Mali et al. 1 201 1| . Before stating 
the assumptions, we introduce the notation used in the proof. 


4.3.1 Notation 

Let N(r) = {t : 9* t 7 ^ 0} be the true neighbourhood of node r and let d r be the degree 
of r, i.e, d r = |AV(r)|. And S r be the index set of parameters {0*j. tk : t £ N(r)} 
and similarly be the index of parameters {6>R. tfc : t ^ N(r)}. From now on 
we will overload the notation and simply use S and S c instead of S r and ,S'j:. Let 
S { r ex) = {0; j;tk ■■ d* rrtk /0A t£ N(r)}. 

Let (}" = V 2 /'(d”.: 'D) be the sample Fischer Information matrix at node r. As 
before, we will ignore subscript r and use Q n instead of Q". Finally, we write Q.s ; s 
for the sub-matrix indexed by S. 

We use the group structured norms defined in [Jalali et al.||20Tl) in our analysis. 
The group structured norm ||u|| e a b of a vector u with respect to a set of disjoint groups 
Q = {Gi,.. ., G t } is debited as |(||u Gl || 6 , • • •, IwgtIUL- We i § nore the 8 rou P ^ 
and simply use \\u\\ ab when it is clear from the context. Similarly the group structured 

J Note that these subsets are different than the edge-wise groups that are L2 -penalized. Rather, these 
constraints apply to the sum of the i th value of each edge parameter and the i th bias weight. 
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norm \\M ||^ Q b ^ , d ^ of a matrix M pxp is defined as 


.-In 

0 


II C,d 

our analysis we always use b = 2, d = 2 and to minimize the notation we use \\M\\ 
to denote ||^f||( 0 , 2 ),(c, 2 )- And we define |Af|| moa . as max \ M iJ 

maximum of M. 


as max i.e, element wise 


4.3.2 Assumptions 

Let us begin by imposing assumptions on the sample Fisher Information matrix Q n . 
Assumption 1. Dependency condition: A miniQ'ss ) — Cmin- 
Assumption 2. Incoherence condition: 

||^s(Qss)- 1 L ,2 - f or some a G t 0 ’ 1 ] ’ where m max = max m t , 

tttrain = min m t . 

t 

Assumption 3. Boundedness: 

A max (E[B (Ayy r ) B (Ayy r ) ]) < D max < oo, where B (Xy\ r ) is a vector such 
that B (Xy\ r ) = {B t (X t )}tev\r- 

Note that the sufficient statistics {B r i(X r )}™f 1 of node r need not be bounded. 
So to analyze the M-estimation problem, we make the following assumptions on log- 
partition functions of joint and node-conditional distributions. These are similar to the 
conditions imposed for sparsistency analysis of GLMs. 

Assumption 4. The log partition function of the joint distribution satisfies the follow¬ 
ing conditions: for all r € V and i G [m r ] 

1. there exists constants k m , k v such that E[B r i(X r )\ < k m and E[B r i(X r ) 2 ] < 
k v , 

A(0) 

2. there exists constant kh such that max u -\ u |<i gff 2 {0*i + u i ) — kh, 

3. for scalar variable p , we define a function A rd as: 


A r ,iWi 9) = l°g f Xp exp {riB ri (X r ) 2 + Jfsev C s( X s ) 
+ (b s (X s ),9 s e st B t {X t )\ |d(a;) 


taN(s) 


( 22 ) 


Then, there exists a constant kh such that max u .\ u \<i d Ar A^ ,d ’-^ ( u ) < kh- 

Assumption 5. For all r € V, the log-partition function A r {.) of the node wise condi¬ 
tional distribution satisfy that there exists functions k\ ( n,p ) and fc 2 (n,p) such that for 
all feasible pairs 9 and X, || V 2 A r (a)|| < k\ ( n,p ) where a £ [6, b + 4 &2 (n,p) max {log (n) , log (p)} 1] 

for b := 9 r + “}2tev\r ®rtBt(X t ), where for vectors u and v we define [it, i?] := 

®i[ui,Vi\. Moreoever, we assume that ||V 3 A r (6)|| < k^{n,p) for all feasible 

pairs X and 9. 
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4.3.3 Sparsistency Theorem 


Given these assumptions in 4.3.2 we are now ready to state our main sparsistency result. 

Theorem 2. Consider the vector space graphical model distribution in © with true 

hold. Sup- 


parameters 9*, edge set E and vertex set V such that the assumptions 1 
pose that 9* satisfies min ||0 * t || 2 > 10 (Ai + A 2 ) and regularization parame¬ 


ters Ai, A 2 satisfy Mi ^ \A’i (n,p) \J < Ai+A 2 < M 2 ^ k x ( n,p ) k 2 ( n,p ) 
for positive constants Mi and M 2 and A 2 < ^^3 


cn+2 rrirt 


Ai. Then, there ex¬ 


ists constants L, ci, c 2 and C 3 such that if n > max{L d 2 /ci (n,p) (£3 (n,p )) 2 ( logp ') 2 log (pm: 

k^ZnTkTi^p) 2 ’ /o £ (St TO «) }’ with probability at least l-c (pT 3 (J2 t m t )~ 

exp(—c 2 n) — exp(—C 3 n), the following statements hold. 

• Tor eac/? noufe r £ V, f/ze solution of the M-estimation problem © is unique 

• Moreover, for each node r £ V the M-estimation problem recovers the true 
neighbourhood exactly. 

where m max = max mt, = min mt, p' = max[n,p). 

The proof of Theorem |2] follow^ along similar lines to the sparsistency proof in 

Yang et al.]|2014|, albeit with a subtler analysis to support general vector-spaces. It is 


2 ) 

max) 1 


based on the primal dual witness proof technique and relies on the previous results. We 
refer the interested reader to the supplementary material for additional details regarding 
the proofs. 


5 Experiments 

We demonstrate the effectiveness of our algorithm on both synthetic data and a real- 
world dataset of over four million foods logged on the popular diet app, MyFitnessPal. 

5.1 Synthetic experiments 

The synthetic experiments were run on a vector-space mixed MRF consisting of eight 
Bernoulli, eight gamma (with unknown shape and rate), eight Gaussian (with un¬ 
known mean and variance), and eight Dirichlet (k=3) nodes. The choice of these 
node-conditional distributions is meant to highlight the ability of VS-MRFs to model 
many different types of distributions. Specifically, the Bernoulli represents a univari¬ 
ate, uni-parameter distribution that would still be possible to incorporate into existing 
mixed models. The gamma and Gaussian distributions are both multi-parameter, uni¬ 
variate distributions which would have required fixing one parameter (e.g. fixing the 
Gaussians’ variances) to be compatible with previous approaches. Finally, the Dirich¬ 
let distribution is multi-parameter and multivariate, thereby making VS-MRFs truly 
unique in their ability to model this joint distribution. 

For each experiment, we conducted 30 independent trials by generating random 
weights and sampling via Gibbs sampling with a burn-in of 2000 and thinning step 
size of 10. We consider two different sparsity scenarios: high (90% edge sparsity, 50% 
intra-edge parameter sparsity) and low (50% edge sparsity, 10% intra-edge parameter 
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Figure 1: ROC curves for our synthetic experiments. The top left and bottom left plots 
show both edge as well as within-edge-parameter recovery performance respectively, 
for graphs with a high degree of sparsity. The two right plots show the same per¬ 
formance measures, but for graphs with a relatively low degree of sparsity. The low 
sparsity scenario is more challenging, requiring more data to recover the majority of 
the graph. 

sparsity). Edge recovery capability is examined by fixing \ 2 to a small value and vary¬ 
ing Ai over a grid of values in the range [0.0001,0.5]; parameter recovery is examined 
analogously by fixing Ai and varying A 2 . We use AND graph stitching and measure 
the true positive rate (TPR) and false positive rate (FPR) as the number of samples 
increases from 100 to 25K. 

Figure [5] shows the ROC curves at both the edge and parameter levels. The results 
demonstrate that our algorithm improves well as the dataset size scales. They also 
illustrate that graphs with a higher degree of sparsity are easier to recover with fewer 
samples. In both the high and low sparsity graphs, the algorithm is better able to recover 
the coarse-grained edge structure than the more fine-grained within-edge parameter 
structure, though both improve favourably with the size of the data. 

5.2 MyFitnessPal Food Dataset 

MyFitnessPa0 (MFP) is one of the largest diet-tracking apps in the world, with over 
80M users worldwide. MFP has a vast crowd-sourced database of food data, where 
each food entry contains a description, such as “Trader Joe’s Organic Carrots,” and a 
vector of sixteen macro- and micro-nutrients, such as fat and vitamin C. 

^http://myfitnesspal.com 
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[saturated fat] 

Figure 2: The top 100 edges in the MyFitnessPal food graph. Purple rectangular 
nodes correspond to macro- and micro-nutrients; green oval nodes correspond to food 
description terms. Edge color is determined by the approximate effect of the edge on 
the means of the node-conditionals: darker, blue edges represent lower means; brighter, 
orange edges represent higher means; thickness corresponds to the norm of the edge 
weight. 

We treat these foods entries as random vectors with an underlying VS-MRF dis¬ 
tribution, which we learn treating the food entries in the database as samples from the 
underlying VS-MRF distribution. The text descriptions are tokenized, resulting in a 
dictionary of approximately 2650 words; we use a Bernoulli distribution to model the 
conditional distribution of each word. The conditional distribution of each nutrient 
(on a per-calorie basis) is generally gamma distributed, but contains spikes at zercj^] 
and large outlier values)^] The gamma distribution is undefined at zero, and the outlier 
values can result in numerical instability during learning, which thus suggests using 
a distribution other than the vanilla gamma distribution. Such zero-inflated data are 
common in many biostatistics applications, and are typically modeled via a mixture 
model density of the form p(Z) = n <5o + (1 — 7r) g(z), where So is the dirac delta at 
zero, and g(z) is the density of the non-zero-valued data. Unfortunately, such mixture 
models are not generally representable as exponential families. 

To overcome this, we introduce the following class of point-inflated exponential 
family distributions. For any random variable Z € Z, consider any exponential family 
P(Z) = exp (rj T B(Z) + C{Z) — A(r])), with sufficient statistics B(-), base measure 
C(-), and log-normalization constant A(■). We consider an inflated variant of this 
random variable, inflated at some value j ; note that this could potentially lie outside the 
domain Z, in which case the domain of the inflated random variable would become ZJ 
{j}. We then define the corresponding point-inflated exponential family distribution 

3 This is common in foods since many dishes are marketed as “fat free” or contain low nutrient density 
(e.g. soda). 

4 This occurs when foods contain few calories but a large amount of some micro-nutrient (e.g. multi¬ 
vitamins) 
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as: 

Pinfi(Z) = exp {r) 0 I(Z = j) +rj[B(z) + C(z) - A infl (? 7 )} , 

where Aj n fl (//) is the log-normalization constant of the point-inflated distribution which 
can be expressed in terms of the log-normalization constant A(-) of the uninflated ex¬ 
ponential family distribution: ^4i n fl( 77 ) = log(exp{77 0 } — expj^f B(j)I(j £ Z)} + 
exp{A(? 7 i)}). Thus, as long as we have a closed form A(-) for the log-partition func¬ 
tion of the base distribution, we can efficiently calculate P m n(Z). The definition also 
permits an arbitrary number of inflated points by recursively specifying the base dis¬ 
tribution as another point-inflated model. We model each of the MFP nutrients via 
a two-point-inflated gamma distribution, with points at zero and a winsorized outlier 
bucket. 

Due to the size of the overall graph, presenting it in closer detail here is not fea¬ 
sible. To give a qualitative perspective of the relationships captured by our algorithm, 
we selected the top 100 edges in the MFP food graph by ranking the edges based 
on their L2-norm. We then calculated their approximate contribution to the mean of 
their corresponding node-conditionals to determine edge color and thickness. Figure 
[2] shows the results of this process, with edges that contribute positively colored in or¬ 
ange and edges that contribute negatively colored in blue; edge thickness corresponds 
to the magnitude of the contribution. A high-level view of the entire learned graph is 
available in the supplementary materials. 

Several interesting relationships can be discovered, even from just this small sub¬ 
set of the overall model. For instance, the negative connection between “peanut” and 
sodium may seem counter-intuitive, given the popularity of salted nuts, yet on inspec¬ 
tion of the raw database it appears that indeed many peanut-based foods are actually 
very low in sodium on a per-calorie basis. As another example, “chips” are often 
thought of as a high-carb food, but the graph suggests that they actually tend to be a 
bigger indicator of high fat. In general, we believe there is great potential for wide- 
ranging future uses of VS-MRFs in nutrition and other scientific fields, with the MFP 
case study only scratching the surface of what can be achieved. 

6 Conclusion 

We have presented vector-space MRFs as a flexible and scalable approach to modeling 
complex, heterogeneous data. In particular, we generalize the concept of mixed MRFs 
to allow for node-conditional distributions to be distributed according to a generic ex¬ 
ponential family distribution, that is potentially multi-parameter and even multivariate. 
Our VS-MRF learning algorithm has reassuring sparsistency guarantees and was vali¬ 
dated against a variety of synthetic experiments and a real-world case study. We believe 
that the broad applicability of VS-MRFs will make them a valuable addition to the sci¬ 
entific toolbox. All code for our VS-MRF implementation is publicly available^] 

“https://github.com/tansey/vsmrfs 
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A Proof of Theorem 1 


The proof follows the same lines as the proof in Yang et al. |2014|. Let us denote Q(X) 
as log (P(X)/P(0)). Note that X = (X-[ . • • • X p ) and each X r belongs to a vector 

space. Given any X, let us denote X s as X s = (X 1; • • • , JY s _ l5 0, X s+1 , • • • , X p ). 
Consider the following expansion for Q(X): 


Q(X) = 

Y l[X t ^ 0\G t (X t ) + ■ ■ ■ 


+ J2 X \ X ^ * 0, ■ • • x tk ± 0 ]G tl ... tk (X tl ... X tk ) 

(i ,-,p} 


(23) 


where X is the indicator function which takes value 1 if its argument evaluates to true 
and 0 otherwise. 

Using some simple algebra and the definition Q(X) = log {P(X)/P( 0)) we can 
show that 


exp (Q(X) ~Q(X S )) 


P(X B \X U - ,x 3 - 1 ,x a+1 ,- ,x p ) 
P(Q\X U - ,X a -x,X a+u - ,x p ) 


From (|23)> we have the following: 


(Q(X) - Q(X a )) = 

X[X S ± 0] I G S (X S ) + Y X ^ * 0]G a , t (X s , x t) 
\ tefi,->p}\* ^ 

+ ^0,--- x t„ ^0}G s , t2 ... tk (X s ,...X tk ) 

{!>••• >p}\s ) 


(24) 


(25) 


Since the node conditional distribution follows the exponential family distribution 
defined in ([6]) we can show that: 

P(X t |X 1 ,-..Y,- 1 ,A-, +1 .- ,X P ) _ 

8 P(0|X 1 ,...,X a _ 1 ,A s+1 ,...,X p ) (26) 

<■ E S (X_ S ),B S (X S ) - B s { 0)) + (C a (X s ) - CM) 

Using ( f25| and ( |26| ) for left and right hand sides of ( |24[ > and setting X t = 0 for all 
t^s we obtain: 


Z{XM$\G s {Xs) = (em,b s (x s )-bm) 

+(C S (X S )-CM) 
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Similarly setting X r = 0 for all r ^ {s, t} we obtain: 


1[X S ± 0]G s (X s ) + 1[X S ± 0,X t ± 0 ]G s ,t(X s ,X t ) = 

(E s (0 ■ • ■ X t ■ ■ ■ , 0 ),B a (X a ) - B s ( 0)) + (C a (X a )~ C a {0)) 

Similarly, replacing X s with X t in and setting X r = 0 for all r ^ {s, t} we 
obtain: 

l[X t + 0 ]G t {X t )+l[X a ± 0, X t 0 ]G s>t (X a ,X t ) = 

(E t {0 ■ ■ ■ X s ■ ■ ■ , 0), B t (X t ) - B t { 0)) + (C t (X t ) - C t (0)) 

From the above three equations we arrive at the following equality: 


{E a ( 0 • • • X t ■ ■ ■ , 0) - E a (0), B S (X S ) - B a ( 0)) = 
(E t ( 0 ■ ■ • X a ■ ■ ■ , 0) - E t (0),B t (X t ) - fl t (0)> 


(27) 


The above equality should hold for the node conditional distributions to be consistent 
with the joint MRF distribution over X with respect to graph G. So we need to find the 
form of E r () that satisfies the above equation. Omitting zero vectors for clarity from 
m we get the following: 


(E,(X,),B t (X,)) = (E,(X t ),B,( X,)) 

•£e„(x.)b„(x,) = E E sl (X t )B sl (X s ) 

j i 

We rewrite the natural parameter functions as 

E tj (X s ) = Y. e ^tjB a i{X a ) + B tj {X a ) 

l 

E sl (X t ) = tjB tj {X t )+ B al {X t ) 


(28) 


(29) 


where Vj B t j{X s ) are functions in the Hilbert space 'H s orthogonal to the span of 
functions B S (X S ), and Mj B s i(X t ) are functions in the Hilbert space TL t orthogonal to 
the span of functions B t (X t )', and 0 s i-tj, 0 s i-,tj are scalars. Combining (28 1 and (29 1 , 
we get 


EE e al . itj B al (X a )B tj (X t ) + B tj (X s )B tj (X t ) 

j i j 

= EE e sl . tj B sl (X s )B tj (X t ) + Y / B al (X t )B al (X s ) 

i j i 
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Rearranging terms in the above equation gives us the following equation: 



- 6 sl , tj )B al (X s ) + B tj (X s ) 


= J2 B si(Xs)B sl ( X t ) 
l 


B tj (X t ) 


( 31 ) 


However, since V) B s i(X t ) is orthogonal to B t (X t ), the left and right hand sides 
of the above equation are equal to 0, which leads us to the following equations. 

Y / B s i(X s )B sl {X t )=0 

l 

- 0 si-,tj)B s i{X s ) + B tj (X s ) 

j \ i 

However since we assumed that the sufficient statistics are minimal we get \/l B s i (Xt) = 
0 from the first equality and Vj, l 0 s i-tj = 0 s i;t.j , B t j(X s ) = 0 from the second equality. 

Hence from ( |29l >, we obtain E s (X t ) = 9 st (B t (X t ) — B t ( 0)) and E t (X s ) = 
9j t (B s (X s ) - H s (0)) where 0 st is a matrix formed by the scalars 0 s i-tj such that 
( @st)ij = @si]tj ^wd. 



j B t:l (X t ) = 0 


1[X S ^0,X t ^ 0]G a ,t(X a , X t ) = 

(B t (X t ) - B t (0)) T eJ t (B a (X s ) - B s ( 0)) 

By extending this argument to higher order factors we can show that the natural param¬ 
eters are required to be in the form specified by 0- 


B Proof of Sparsistency 

Before proving the sparsistency result, we will show that the sufficient statistics B r (X r ) 
are well behaved. Recall that B r i(X r ) indicates i th component of the vector B r (X r ). 
We set the convention that whenever a variable has the subscript \r attached we will 
be referring to the set of indexes { (t, j , k) : 9 r j. t k £ 9 r .,t ^ r}. 

Proposition 1. Let {X , ' :i ' 1 }" =1 have joint distribution as in then, 

(M<0 2(34 > 

for S < min {2^, kh + k v }. 
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Proof. It is clear from Taylor Series expansion and assumption[4]that 


logE exp \tB ri (X r ) 2 

log 4. ew *. e ^{ tBr< (' Yr ) 2 + 

E ( b s (x s ), e* + E e* st B t (x t )\ + 

sev \ teN(r) / 

E C s (X s ) - ,4(0*)} v (dx) 
sev J 

= Ar t i (if, 9) (t; 9*) - Ar ti ( 77 ; 9) ( 0 ; 9*) 

^ T 2 ^h 


( 35 ) 


where u £ [ 0 , 1 ] 

Therefore, by the standard Chernoff bounding technique, for t < 1, it follows that 

exp (—nSt + n k v t + i^K h nj < (^ 6 ) 

exp (r n i) 

for 6 < min { 2 ^, kh + k v }. □ 

Proposition 2. Let X be a random vector with the distribution specified in ( | !()[ >. Then, 
for any positive constant d and some constant c > 0 

p(\B n (X r )\ > 5log(p )) < cp~ 5 (37) 


Proof. Let v be a unit vector with the same dimensions as 9*. and exactly one non-zero 

entry, corresponding to the sufficient statistic B r i(X r ). Then we can write log ^E[exp(f? r i(X r ))] 

as: 

log (E[exp(E„(.Y r ))]) = A(9* +v)~ A(9*) 

By Taylor series expansion, for some u £ [0,1], we can rewrite last equation as 


A(9* + v) - A(9*) =VA(9*)-v + \v T X 2 

=E[B ri (X r )]\\v\\ 2 
1 d 2 A(9*+uv) 

+ 2 W Ti 11 


A(6* + uv)v 


v 


2 

2 
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Using Assumption]?] we get the inequality : 

A(0 + v) — A(6 ) < k m + — kh 

Now, by using Chernoff bound, for any positive constant a, we get P(B r i (X r ) > a) < 
exp (—a + k m + \kh)- By setting a = Slog(g) it follows that 

P(B ri (X r ) > 5log{rf)) < exp (-Slog(g) + k m + -k h ) < cg~ s 

where c = exp(£: m + \kh) □ 

The proof of Sparsistency is based on the primal dual witness proof technique. First 
note that the optimality condition of ( |T4| , can be written as: 

S7 £(9 r .',T>) + Ai 'y ' \fvrtZi,rt + ^ 2^2 = 0 (38) 


where Z\ rt e d || 9 r t || 2 , Z 2 G d || 9\ r ||i and we denote Z = [Z\ , Z^ , where 
Z-\ = {Zi, rt } t £v\r- And sub-gradients Z\, Z 2 should satisfy the following conditions: 

Vi (Z 2 )i 


sign y(O r )ij if {9 r )i 0 

| (Z 2 )i | < 1 otherwise 


Vi Z- 


1 ,rt 


hr if 0rt ± 0 

< 1 otherwise 


(39) 


Z \ .rt 


Note that we can think of Z\ and Z> as dual variables by appealing to Lagrangian 
theory. The next lemma shows that graph structure recovery is guaranteed if the dual 
is strictly feasible. 


Lemma 1. Suppose that there exists a primal-dual pair ^9 r .,Z^j for ( j!4| > 


such that 


n,s c 


00,2 


< 1 and 


Z- 


2 ,S C 


< 1. Then, any optimal solution 9 r . must satisfy 


^0 r .J = 0. Moreover, if the Hessian sub-matrix [V 2 f(0 r .)]ss is positive definite 

then 9\ r is the unique optimal solution. 

Proof First, note that by Cauchy—Schwarz’s and Holder’s inequalities 

{Z\,rt,O r t) <|| 9 rt ||2 and (Z 2 ,9\ r ) <|| 9\ r . (40) 
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But from ( |38j ) and the primal optimality of 9 r . and 0 r for ( p~4] ), 

£ + Yht^r X lV"rt(Zl,rt, @rt) + X 2 (Z 2 , &\ r ) 

> minf ( 9 r .) + Yht^r Ai-^Avt^i.rt, @rt) + X 2 (Z 2 , 0\ r ) 

= £ (fir ') + Ylt^r ^ly/ v rt{Z\,rt, 9rt) + X 2 (Z 2 , 9\ r ) 


( 41 ) 


— £ (d r . j + YLt^r X\yjv r i 


7\r 


hence, combining with ( [40] ) with ( [41] ) it follows that Y2tj= r At \/Vrt 


d r 


V 


— Y2tjtr My/VrtiZl ,rt,9 rt )+X 2 {Z 2 ,6\ r ). The result 


— Al \/v r t @rt 9y 

follows. 

If the Hessian sub-matrix is positive definite for the restricted problem then the 
problem is strictly convex and has a unique solution. □ 


Based on the above lemma, we prove sparsistency theorem by constructing a primal- 
dual witness (6 r ., Z) with the following steps: 


1. Set 


0r.)s — or (7771*71^(0^ gj £ {{9 r -)g | 2?) +Ai Y^it^S s/^rt II @rt II 2 T A 2 I (9 r .)s 


2. For t £ S, we define Z-\ rt = 
condition. 

3. Set {9 r .) S o = 0 

4. Set Z 2 .S'- such that Z 2 ,s c 


and then construct Z 2 .s by the stationary 


< 1 


5. Set Z 15 c such that condition ( f38] > is satisfied. 

6 . The final step consists of showing, that the following conditions are satisfied: 

(a) strict dual feasibility : the condition in Lemma|T]holds with high probabil¬ 
ity 

(b) correct neighbourhood recovery, the primal-dual pair specifies the neigh¬ 
bourhood of r, with high probability 

We begin by proving some key lemmas that are key to our main theorem. The 
sub-gradient optimality condition ([38]) can be rewritten as: 


W(0 r .; D) - W(0* ;Z>) = W n - A, ]T sfWtZy rt - X 2 Z 2 (42) 

t-.r^tt 

where W n = —V£(()*.: T>) and 9*. is the true model parameter. By applying mean- 
value theorem coordinate wise to (|42|), we get: 


*.;V)[9 r .-9*] = W n -\ 1 Y2t.. r ^V^~tZ 1 , rt 
—X 2 Z 2 + R n 


(43) 


20 
































where R n is the remainder term after applying mean-value theorem:/?™ = [V 2 f (6*. ; V) — 
\7 2 £(9 J r .; V)]J(6 r . — 9*.) for some 9f on the line between 9 r . and 9*., and with [,]J 
denoting the j-th row of matrix. The following lemma controls the score term W n 

Lemma 2. Recall v r max = maxv rt , v rm m = min u rt , p' = max(n,p). Assume that 


8(2 a) J fci (re, p) fc 4 < 

Oi y Tl l/ r ln in 

Ai + A 2 < 4(2 ~^ v/iyrm - ki(n,p) k 2 (n,p) fc 4 


(44) 


for some constant fc 4 < min {2 ^-,kh + and suppose also thatn > fe2 h log rn t ) 


then, 


p \\wr 


\ Q: y/v r min (Ai+A2)\ / 

'.2 > 2 =^- 4 - 1 < 


(45) 


\r II 0 

1 - cip ,_3 m t) - exp (—c 2 n) - exp (-c 3 n) 

Proof. Define Wf = —S7g rt £ (9*.;T>). Let Wf ]k be the element in Wf corresponding 
to parameter 9 rj;tk . Note that Wf jk = ^t,jk where 


V t \ jk =B rj 


(x®) B tk (x t W ) - 
Vo rj ., tk A r (9*+Zsev\r°*rsBs (*«)) B tk (x t W ) 


so for t' £ 



where c = 9* + Yls^r Ks B s (X l s ) + v\t'B tk (x} 1 ^ for some vi £ [0,1], There¬ 
fore, 
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2=1 


^El°g E exp(^. fc )l4v 


^ 2 V W r ,, tk Mc)B tk (X®) f' 2 
2=1 

Next lets define event e 4 = { max ? jt ||i? ( j < 41ogp'}. Then, from 

Proposition |2lwe get P(e\) < C\np'~ A (J2 t m t ) < cip'~ 3 (J2 t TO *)- If t' < (n,p), 

Assumption[5]implies that 


n 


i^logi? exp(f , ^ ife )|X« r 


< 


2 = 1 
(n, 
2 




Now, lets dehne event £■> = 


e 2 = {max^ X^"=i ^ M where k 4 


< 


min{2/c„/3, fcft +fc,,}. Then, by proposition (Ilk we obtain that if n > log(S#ev m t ) 


P (e c 2 ) < exp ( -n -^2 + log ( E m * ) ) ^ ex P (~ n ) 


\t£V 


Therefore, for t' < k 2 {n,p). 


i=l 


- E log^ ex p(fE^)l x n 


< 


k 4 (n,p)k 4 t' 


Hence, by the standard Chernoff bound technique, for t' < k 2 (n, p) 


(46) 


(47) 




i=l 


p)kjt ' 2 


2exp (ji ( kl ( n ’ 

Setting t' = ki{n s p)ki , for 5 < k 4 (n,p) k 2 (n,p) fc 4 , we aiTive to: 
p ^ E \ V ljk\ > 5 I e i>^ < 2exp ^ US 


2 k\ ( n,p)k 4 


Supposing that a ^ mi " Al+Aa 


(48) 


(49) 


4 yEi — ^i( n ’P) ki{ n iP) & 4 - It then follows that S = 


2—0! 4 y/m r m± 


Al+A2 satisfies 
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( 50 ) 


P [ 1 nilK jk \> ( ^_ 

V i=1 


Ai + A 2 


2 exp 


— a 4 y/m r m t 

n (A1+A2) 


£l,£2 < 


(2—a) 2 32 ki(n,p) k 4 m r mt 

Form which, we obtain the following using union bound 


p (\m 112 > ^ v ^( a 1+ a 2) |£ij£2 ^ < 


P \\w t 


t 11 00 ^ 2—a 4 y/m r m t 


(X y/Vr min (A1+A2) 


kl?^2 


< 


2eX P ( (23^2 3^#^! + lo 8 K*)) 


and hence, 

p(\\w 

2 exp 


,2 > 


2-a 


-a 2 ^rminn (Al + A 2 ) 2 


^ mi „^ 1 + A 2 ) | £i)£2 ^ < 

+ log (v r m ax) + logp 


(2 — a) 2 32 ki{n,p)k^ v r 

\ / 
Finally for Ai + A 2 > 8 ( 2 / \Jki(n,p) k 4 ; we obtain 


(» 


P IIW" 


3,2 > 23 


/^V min (Ai+A 2 ) 


2-a 


< 


cip' 3 (St m t) + exp (-c 2 n) + exp (-c 3 n) 


Lemma 3. Suppose that X 1 +X 2 < 40logp , Dmax Ck 3 (n, p) 
then, 


(51) 


(52) 




< 


(53) 

□ 

(A-| +A 2 ) a 


4(2- 


P (|K^.)S - (<S)s||oc ,2 < (Al + A 2 )) 

> 1 - Cp'~ 3 (St TO t) 

/or some constant c > 0 . 

Proof. We define F{us ) as: 


F(us) = t m.) s + us ; £>) - e m.)s ; 2>) 

+ Al V^Vf (ll^rt + u rt||2 - ll^rtlh) (55-) 

£G iV(r) 

+A2 (II ( 0 *.)s + «s||l - II ( 0 *.)s 111) 

From the construction of 9 r . it is clear that u s = (Q r .)s ~ (6 *.)s minimizes F. 
And since F(0) = 0, we have F (u s ) < 0. We now show that for some B > 0 with 
Moo 2 = B, we have F(us) > 0 . Using this and the fact that F is convex we can 
then show that ||ws|S 2 < B. 

Let us an arbitrary vector with Ittsloo 2 = 5 )g max (A! + A 2 ). Then, from the 
Taylor Series expansion of log likelihood function in F, we have: 
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( 56 ) 


F(u s )= S7tm.) s -V) T u s 

+zt£V 2 £ ((0* ) s + vu s ) u s 
+ Al ^2 {\\Kt + U rt\\2 ~ \\Kt\h) 

££ N(r) 

+A 2 (|| ( 0*.) s + us||i - II (0* )s II 1 ) 


for some v £ [0,1] 

We now bound each of the terms in 
Cauchy-Schwarz inequality we obtain: 


the right hand side of (56 1 . From ([5T]> and using 


W((f9*.) s ;2?f u s 

< l|V^((^.) s ;2?)|| 00 ||« s ||i 

< m m.)s^ v ) Woo dr max 11 '^'5' 11 oo,2 

— " 1 4 " “ d r W max (Ai + A 2 ) 

= tc£t dr ( Al + A2 ) 2 


(57) 


where the last inequality holds because a £ (0,1], Moreover, from triangle in 
equality we have: 

At E V / W* (\\0*t + u rth ~~ ll^rtlb) 

tes 

> -Ai E "\/ Vrt 11 U r t 11 2 

teJV(r) 

^ Al d r yjl^r max 11 11 oo,2 

- Ai (Ai + A 2 ) 


(58) 


Also, 


A 2 (II W.) s + «slli-ll W.)slli) 

> —A 2 ||ws||i 

> —A 2 jjusjli 

A A 2 d r yj V T max ||tt5||oo,2 


(59) 


5 v. 


r M d r A 2 (Ai + A 2 ) 

O'min 


On the other hand, by Taylor’s approximation of V 2 £, there exists a :j k £ [0,1] and 
between 0 \ r and 0 \ r + vus such that 

Amin (v 2 £ ((0*.)s + vu s)) 

> ^min Amin (V 2 £ ((0*.) s + /3 u s )) 

A A m i n (Qss) — 


max max 

«e[o,i]||»ll< 


■ <UE E a ^ v (y 3 A r («j fc )) 


(60) 


i=l j,k,l,t,h,s,m,sf,mf 
Urt;lh Bth ys,m,j Bsm (-^s) ^s'm' (-^v) Us',m' J' 


jkl 


Consider the event ei as defined in the previous proof. We know that P(e i) > 1 — 
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Cip 


f 3 (Et m t)- Conditioned on e-| and using Assumption[5]we arrive to the following: 


A m i„ ( V 2 £m.) s +VU S )) 

f 'inin 41ogp US 11 I D max V r m ax (A. p) 

S. f 'inin 41ogp (1 r \Jl J r max 11 tl -S' 11 oc . _2 ^max ^rmax^3 (A- J>) 


C-^mir 
— 2 


(61) 


where the last inequality holds for Ai + A 2 < 


40 logp' Dmax d T k 3 {n,p) I/2 max 

Finally using the above bounds we arrive at the following: 


F (us) > d r t'r 


a 


(At + A 2 ) z ( -1 - | + 2 ) > 0 


Therefore 


|(d;) S -(dr-)s||oo,2< (A 1 + A 2 ) 


Or 


Lemma 4. Suppose that Ai + A 2 < 


CL, 


(62) 


□ 


a (A 1 +A 1 ) 

4 (2—a) 


400 Urmaxlogp' D max k 3 (n.p) d r 


and ||W( 


then, 


P 


| ||oo ,2 < 


Ai + A 2 4 (2 — ex) 
for some constant c > 0. 


>l-cp' 3 ^ 


m t 


\ r ||oo,2 ^ 


(63) 


Proof. Recall that Rf = [V 2 £(6>* ;£>) - V 2 £(^.;P)] J (d r . - 0* ) where [ • ] J de¬ 
notes the j-th row of a matrix. Let us also refer to to the cordinate of R n corre¬ 
sponding to d r j-tk ■ Then, 


pn _ 

n Ujk - 




V 2 Ar - 


s^r 


V 2 A r + E.,^r ^ (Xj)) ® Bf (§ r . - 0* ) 


4 T 


(64) 


J 4 

with Bf the vector of sufficient statistics evluate at the i-th sample. Intoroducing 
the notation (0 r ., Bf) =: 6 r + E s ^ r ®rsB s (X]J, from the mean value theorem we 
obtain 


S7 2 A r , 3l «0* - V 2 A r;jZ = 

-4« [<0r- - dr- ,4.)] (V 3 Ar) (, B* ‘ 


(65) 


Therefore, combining ([64]) with (651 and using basic properties of krockner product we 
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ontain that 


K 


t;jk 


^ n ^ \ Rfh (-X ) W max k‘A (w P) R max || @r- 


* II 2 


i=l 


( 66 ) 


^ 4 log p V r max^-3 {P'l P) Dmax\\@r- ^ 7 -- 11 2 


which implies 


I Rt II 00,2 < 


4 y/V r max 1 °&P V r max^3 { n i P) D max IK-«< 


4 yjv r max \0gp ^rmax^3 ij^iP) D n 

A 1 +A 2 Q Vermin 
4 2—a 

with probbility at least 1 — cp '~ 3 (Xt nrit). 


25 d r v. 


rmax \ 2 

c 2 . 

min 


*i < 


(67) 


□ 


We now prove theorem]^] using lemmas 2]|4 Recalling that Q" = V 2 £(0~_: 'D) and 
the fact that we have set ( 9 r -)s c = 0 in our primal-dual construction, we can rewrite 
condition (|43| as the following equations: 


Wgc — Ai Xt^A T(r) \/ v rtZl,rt ~ ^2^2,S° + Rg;c 


(68) 


(69) 


QSsI&Os - W-)s] = a 

Wg — Ai Xte N(r) yJ v rtZ\,rt ~ ^2^2, S + R S 

Since the matrix Qgg is invertible, the conditions ( |68| l and ( [69] ) can be rewritten as 

Q's^siQss') 1 [^Xs — Aj S y^ j \[v7tZ\,rt ~~ A 2 Z 2} S + Rs] = 

£EiV(r) 

Wgc ~ Ai XtgAT(r) yJ v rtZ\,rt ~ ^2%2 ,S C + -Rgc 

Rearranging yields the following condition: 


(70) 


St^AT(r) — 

W£ + i?g c - QgcsiQss^iWs + ^5]- 

A 2 X 2 ,3= + Qscs(Qss) 1 ['^1 X^teAT(r) \/RrtZ\,rt + A2^'2,s] 


(71) 


Strict Dual Feasibility: we now show that for the dual sub-vector Z lt gc, we have 


Zl,S‘ 

ity: 


< 1. We get the following equation from 


71 


by applying triangle inequal- 
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(72) 


Al-y /7 


Ai 


Zi, s = < 

00,2 


St^iV(r) -\AVt^l.rt 

|W"L,2 + Halloo,2 
+A 2 y 7 v max 


< 


(l + llQs^sCQss) -1 !^^ 


+ ||Qs c s((3ss) 1 || 00,2 [^! + ^ 2 ) V dr ^max] 
where tVmin = minry t , i/ rmax = maxi/ rt and d r = |iV(r)| Using mutual incoher¬ 
ence bound [2] on the above equation gives us: 


Z\ ,S C 

i 

Xly/Vri 

X 


< 

00,2 

V”lloo,2 


Al yjl\- 


1 


l^lloc,2 

Q n scs(Q n ss) 1 


(2-a) 

loo,2 ^ + 1 


(73) 


Using the previous lemmas we obtain the following: 


Zi 


,s c 


— 2X7 


+ A 2 ytv^ L + _ a ) f Al + A 

X\y/ l/ r min 777 max ' ' y A2 y 


(74) 


If A 2 < 


2—a+2 — 


Ai, then. 


Z\ t s° 


00,2 


< 1 


(75) 


We have shown that the dual is strictly feasible with high probability and also the 
solution is unique. And hence based on Lemma [T] the method correctly excludes all 
edges not in the set of edges. 

Correct Neighbourhood Recovery: To show that all correct neighbours are recov¬ 
ered, it suffices to show that 


|(0*.)S - iPr- )s It oo,2 < 


where 9 min = 


min te v\r Frill 2 


Using Lemmajijwe can show the above inequality holds if 9 m i n > 10 W 1r q. ^ 2 ) 


C Full MyFitnessPal Graph 

Figure|3]shows a high-level view of the entire VS-MRF learned from the MyFitnessPal 
food database. The three macro-nutrients (fat, carbs, and protein) correspond to the 
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Figure 3: Full MRF learned from the MyFitnessPal food database. The hubs cor¬ 
respond to point-inflated gamma nutrient nodes, with the three largest hubs being the 
macro-nutrients (fat, carbs, and protein). 


three largest hubs with the remaining nine micro-nutrients representing smaller hubs. 
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